Automated identification of borrowings in multilingual wordlists

نویسندگان

چکیده

Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification borrowings in datasets. Moreover, none solutions which proposed so far identify across multiple languages. This study proposes a new method for task and tests it on newly compiled large comparative dataset 48 South-East Asian languages from Southern China. The yields very promising results, while conceptually straightforward easy apply. makes approach perfect candidate computer-assisted exploratory studies contact areas.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LexStat: Automatic Detection of Cognates in Multilingual Wordlists

In this paper, a new method for automatic cognate detection in multilingual wordlists will be presented. The main idea behind the method is to combine different approaches to sequence comparison in historical linguistics and evolutionary biology into a new framework which closely models the most important aspects of the comparative method. The method is implemented as a Python program and provi...

متن کامل

Automated Alignment in Multilingual Corpora

Experiences in computing alignments at the paragraph and sentence level within a project TRANSLEARN in the European Union's "LRE" programme of research and development in language engineering are reported. About 98% of the sentences in pairs of corpora in different languages have been aligned correctly by a method that uses dynamic programming on numbers of characters per sentence. This paralle...

متن کامل

Using Sequence Similarity Networks to Identify Partial Cognates in Multilingual Wordlists

Increasing amounts of digital data in historical linguistics necessitate the development of automatic methods for the detection of cognate words across languages. Recently developed methods work well on language families with moderate time depths, but they are not capable of identifying cognate morphemes in words which are only partially related. Partial cognacy, however, is a frequently recurr...

متن کامل

Language Identification in Multilingual Documents

Most optical character recognition (OCR) systems can recognize at most a few languages. For large archives of document images that contain different languages, there must be some way to automatically categorize these documents before applying the proper OCR on them. This report presents a research in the identification of English, Chinese, Malay and Tamil in image documents. While most other wo...

متن کامل

Improving Automated Alignment in Multilingual Corpora

We report on methods of improving multilingual text alignments that have been produced in a simple dynamic-programming scheme, by automated detection of possible misalignments. Details of methods involving cognates, speciallyidentified words, and propositional contents of sentences are given, together with notable features of their performance on parallel corpora in a number of different types ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Open research Europe

سال: 2022

ISSN: ['2732-5121']

DOI: https://doi.org/10.12688/openreseurope.13843.3